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The inference of gene-regulatory networks from high-throughput gene-expression data is one of 
the major challenges in systems biology. This paper aims at analysing and comparing two different 
algorithmic approaches. The first approach uses pairwise correlations between regulated and regu- 
' lating genes, the second one uses message-passing techniques for inferring activating and inhibiting 

\ regulatory interactions. The performance of these two algorithms can be analysed theoretically on 

^s^j . well-defined test sets, using tools from the statistical physics of disordered systems like the replica 

method. We find that the second algorithm outperforms the first one since it takes into account 
collective effects of multiple regulators. 
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PACS numbers: 87.16.Yc Regulatory chemical networks, 02.50.Tt Inference methods, 05.20.-y Classical 
statistical mechanics 



I. INTRODUCTION 



O: 

According to the central dogma of molecular biology, there is a directed information flux from genes via mRNA 
to proteins whose aminoacid sequence is coded in the genes. This simplified view leads, however, to a paradoxical 
situation: All cells in a multicellular organism carry the same genetic information, i.e., the same DNA sequence, 
but they differ largely in their protein content which is responsible for the different functionality of, e.g., neurons, 
i ^ , '■ liver cells, skin cells etc. The crucial process which allows for cell differentiation is gene regulation as sketched in 
Fig. [TJ The transcription of a gene to mRNA is achieved by RNA polymerase (POL) or, in higher organisms, a protein 
complex containing polymerase. The binding of this polymerase to the DNA in the so-called promoter site depends, 
however, on the presence or absence of other proteins, the transcription factors, which have binding sites close to the 
promoter site. These transcription factors are themselves proteins, so they are coded for in genes in other regions 
of the DNA. We say that there is a gene- regulatory interaction from genes 1 and 2 to gene 0. Since the expression 
of genes 1 and 2 may be regulated by further transcription factors, the set of all such interaction forms a complex 
gene-regulatory network (GRN). 
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FIG. 1: Cartoon of transcriptional regulation of gene by two the transcription factors protein 1 and protein 2. 

If we knew the gene-regulatory network of an organism, it would be an interesting question to infer the dynamical 
behaviour and stationary points of these networks to understand the process of cell development and differentiation. In 
addition, this knowledge would allow us to numerically predict the outcome of genetic interventions, and eventually 
to identify new drug targets for medical treatment. It is, however, a very complicated, time-consuming and cost- 
intensive task to experimentally determine the detailed structure of such networks using techniques like gene-knockout, 
chromatin immuno-precipitation (ChIP) etc. In fact, genome-wide networks are only know for simple organisms like 
E. coli and baker's yeast 0,0] , whereas for higher organisms only few functional modules are known, see e.g. 0,0]. 

On the other hand, today it is relatively easy to monitor the gene expression on a genome-wide scale, the most 
important experimental tools here are microarrays (DNA-chips) which measure the simultaneous abundance of tens 
of thousands of mRNAs, i.e. of the products of the transcription process, which later on get translated into proteins. 
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It seems therefore an enormously interesting task to infer gene-regulatory interactions from expression data using 
bio-informatics means. Even if such an approach cannot be expected to replace a direct experimental determination 
of such interactions, it may well serve as a guide to design efficient and focused experiments. 

In the literature, various inference algorithms are discussed. A first class deals with relevance networks, measuring 
the correlation resp. mutual information of the expression of gene pairs 0,3- This approach obviously may include 
also indirect interactions, since also second or third neighbours in a regulatory network may be correlated. This is 
taken into account in the algorithm ARACNe 0, [l(| , which uses information-theoretic arguments to prune all those 
links which can be explained by indirect interactions. In Bayesian and dynamic Bayesian networks [ill [l2l [T3L [l4j . 
also the directed structure of causal dependencies between gene expression levels is inferred: A global scoring function 
is assigned to arbitrary networks, and the highest scoring network is considered to be the best possible inferred 
network. Unfortunately, finding this network is an NP-hard task, it can reasonably accessed only using heuristic 
tools (greedy algorithms with restarts, simulated annealing...) which are likely to get trapped in local extrema of the 
scoring function. A somewhat similar approach are probabilistic Boolean networks [l5l flq | which exhaustively scan 
all gene pairs or triplets in order to see in how far they determine the expression level of an output gene. Even if being 
polynomial, this algorithm is restricted to relatively few genes and regulatory functions with at most three inputs. 

The inference of gene-regulatory networks is unfortunately plagued by serious problems based on the quantity and 
the quality of available data. Some of the most relevant limitations are listed below: 

(i) The number M of available expression patterns is in general considerably smaller than the number N of measured 
genes. 

(ii) The available information is, in general, incomplete. The expression of some relevant genes may not be recorded, 
and external conditions corresponding, e.g., to nutrient, mineral, and thermal conditions are not given. Micro- 
arrays measure the abundance of mRNA, whereas gene regulation works via the binding of the corresponding 
proteins (transcription factors) to the regulatory regions on the DNA. Protein and RNA concentrations are, 
however, not in a simple one-to-one correspondence. 

(Hi) The data are noisy. This concerns biological noise due to the stochastic nature of underlying molecular processes, 
as well as the considerable experimental noise existing in current high-throughput techniques. 

(iv) Micro-arrays do not measure the expression profiles of single cells, but of bunches of similar cells. This averaging 
procedure may hide the precise character of the regulatory processes taking place in the single cells. 

(v) Non-transcriptional control mechanisms (chromatin remodelling, small RNAs etc.) cannot be taken into account 
in expression based algorithms. 

(vi) A last problem is a conceptual one: Searching for dependencies between the expression levels of genes does not 
automatically imply their direct physical interaction in transcriptional regulation. To give an example, strongly 
coregulated genes may appear as good predictors of each other, without having any direct interaction. 

The listed points obviously lead to a limited predictability of even the most sophisticated algorithms. It is therefore 
very important to test various algorithms on the basis of well described data sets containing some or all of the 
before-mentioned problems: Only a critical discussion on artificial data sets allows for a sensible interpretation of the 
algorithmic outcomes when run on biological data. 

In this paper, we present a recently introduced statistical-physics motivated inference algorithm build on a message- 
passing technique which, in a sense, is equivalent to the Bethe-Peierls approximation of statistical physics [l7l fla]. 
We further use statistical-physics tools, more precisely the replica method developed in the theory of disordered 
systems and spin glasses, to characterise the performance of this algorithm on artificially generated data, and to 
compare it to relevance resp. ARACNe networks. We will show that the message-passing technique is able to take 
into account also collective effects in the common action of various transcription factors, and therefore it outperforms 
local pair-correlation based algorithms. 

The structure of the paper is as follows: In the following section, we formalise the network inference problem. In 
the third section, we review two types of inference algorithms, namely pair-correlation based methods and a message- 
passing technique. Than we introduce a generator for artificial data and perform first numerical tests. Sec. fVl contains 
a statistical-mechanics approach to the performance of both algorithms, and a careful analysis of their behaviour 
under various (artificial data) conditions. At the end, we conclude this work, and discuss extensions of the message 
passing techniques which are necessary for a successful application to biological data. 
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II. THE NETWORK INFERENCE PROBLEM 



Let us first formalise the network-inference problem. Given are stationary points of the regulatory network, or gene- 
expression profiles (Note: Temporary expression data can be learnt with a modification of the presented algorithm. 
Here we concentrate fully on static data.) They can be written in the following form 

x£, i = 0,...,N, fi = l,...,M (1) 

with i enumerating the considered genes, and /i the different expression patterns (profiles). In this work, we assume 
the expression values to be binarized, x^ — — 1 signifies that gene % has very low expression in patterns fx, whereas a 
high expression is denoted by x^ = 1. Note that we use an Ising spin notation instead of a more standard Boolean 
one, but this choice will render the technical presentation better accessible. Such a binary representation seems to be 
oversimplified since gene expression is characterised by mRNA concentrations, but it is applied successfully in various 
biological examples [a, [a , lla ] . It is expected to work pretty well since gene-regulatory functions show frequently a 
switch like behaviour [191 ]. The simplest examples, repression and activation by a single TF, are illustrated in Fig. [5] 




Gene 1 Gene 2 



FIG. 2: Schematic representation of transcriptional repression (left) and activation (right). Repression works via the competitive 
binding of POL and the TF to overlapping binding sites, wheres activation uses a short-range attraction between POL and TF 
when attached to close-by binding sites. 



At this level, the inference problem can be considered to factorise over the output variables: We can first reconstruct 
the regulators for variable 0, then for variable 1 etc. until we reach variable N. In our presentation we concentrate 
therefore to one of these subproblems: Without loss of generality, we are interested in the question in how far and in 
which way gene is regulated by the genes 1,...,N. We can distinguish two subsequent subproblems 

• Topology: Which genes do influence gene 0, and which other genes do not influence gene 0. This result of this 
question are directed links from the regulating genes to the regulated gene 0. 

• Functionality: How do these genes influence the expression of 0? Here we want to restrict this question to the 
simplest types of interactions, classifying the input genes 1,...,N as activators, repressors or irrelevant for gene 
0. 

Note that the factorised approach may be oversimplified if experimental noise is considered. Here we are, however, 
not interested in giving a biologically realistic modelling, but in a theoretical analysis of the proposed message-passing 
algorithm. We therefore keep this assumption since it allows a more elegant and clear presentation of the results. 
As mentioned, we aim at classifying the directed interactions from gene i to gene by a ternary variable, 

{—1 if gene i represses the expression of gene 
if gene i does not regulate gene (2) 
1 if gene i activates the expression of gene 

Again, this classification is oversimplified with respect to biological reality. First, it does not include the fact that 
there may be strong and weak activators or inhibitors, all of them are just characterised by one single variable. Second, 
there may be interactions where a gene changes its activating or inhibiting character in the presence of other genes. To 
give an example, the XOR function of two variables cannot be represented within the above-mentioned classification. 

It would be possible to overcome this simplification by introducing more complicated variables. On the other hand, 
this would increase the number and complexity of the model-parameters, and consequently the risk of overfitting. 
The task of extracting a good predictor from few noisy data requires the model to depend on as few paramters as 
possible. 
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To infer this interaction vector, we have also to set up a minimal functional model for the joint interaction of all 
relevant input genes. Having classified the input genes by this ternary variable, we can sum up the total influence of 
an expression pattern x = (xi, ...,xn) on gene 0. If it overcomes some threshold 9 (which, for clarity of notation, we 
set to zero in the following), the expression of gene is expected to be enhanced, below this threshold the expression 
level of gene is repressed. At the Boolean level, we can therefore model the regulatory interaction by the threshold 
function 




x = sign > JiXi . (3) 



Note that, slightly more than one decade ago, this function has found considerable attention in statistical physics in 
the context of the theory of artificial neural networks 0, HH, it describes a diluted perceptron. In fact, the technical 
approach presented below consists of an adaption of the well-known Gardner calculus of perceptron learning [22j | to 
the gene-regulatory problem. 

Using this model, we are looking for classification vectors J — ( Ji, Jn) which are as compatible as possible with 
the input data x^, fi = 1, ...,M. We therefore introduce a Hamiltonian 

M I N \ 

w(j) = E e -<-E J ^n ( 4 ) 

M=l V i=l / 

which counts the number of patterns whose output contradicts the model ([3]) . The 6 symbolises the step function, 
which jumps from zero for negative arguments to one for non-negative arguments. One aim of the algorithm will be 
to minimise this Hamiltonian. 

To get reasonable results, we will also implement another biologically motivated constraint. Gene regulatory 
networks are diluted, i.e. the number of transcription factors for one gene is generally restricted to a small number. In 
the best known regulatory network, the one of bakers yeast, one finds evidence that the number of genes regulating 
commonly one gene is distributed according to a narrow exponential distribution, which can be explained by physical 
constraints for the positioning of transcription factor binding sites. To take into account this diluted character, we 
will also control the number 

N 

N eff (J) = J2(^-^o) (5) 

i=l 

of effectively present, i.e., non-zero links via a conjugated external field. This a priori bias towards diluted graphs 
helps again to avoid overfitting of the data, since it reduced the entropy of the search space. 

This idea is closely related to methods of regression and graphical-model selection using ^i-paramctcr rcgularization 
discussed in the machine-learning literature. The original idea was developped by Tibshirany for the case of sparse 
regression in linear models [23| , but it can be extended to more general convex optimization problems 0, HH, [26|, [27| • 
In all these cases, the optimization of the cost function is restricted by the £i-norm | Jj| of the model parameters 
Ji. The cusp- like structure of this constraint in Ji = forces a part of the parameters to be exactly zero. In the 
presence of sufficient data this method can be shown to asymptotically reproduce the correct topological structure of 
the underlying model [28| . In principle, one could use any £„-norm \ J%\ n f° r regularizing parameters. However, 
only for < n < 1 some of the optimal parameters assume values equal to zero; for 1 < n the constraint is convex, 
thus it can be added to the cost function without destroying its convexity. Only the £i-norm fullfils both conditions. 

Our case is, however, different from these examples. The discrete nature of the model parameters renders the model 
a priori NP-hard, and convex optimization cannot be applied. Furthermore, we are interested in ensembles of possible 
networks rather than a single network as predicted by convex optimization. 



III. INFERENCE ALGORITHMS 



In this section, we present two classes of inference algorithms. The first one is based on pairwise correlations between 
the output variable xq and inputs Xi. This correlation can be measured in different ways, two of the most important 
measures are the connected correlation coefficient (or its normalised version, the Pearson correlation coefficient), and 
the mutual pairwise information of the two variables. Inference algorithms based on this measure are wide spread, 
in their simplest version they result in the so-called co-expression or relevance network. Such networks normally 
contain much more links than the actual regulatory interaction between transcription factors and regulated genes: 
Also second neighbours may still be considerably correlated. This is taken into account in a very clever way in 
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the algorithm ARACNE which eliminates all those links which, from an information theoretic point of view, can 
be explained via a two-step interaction with an intermediate variable. This elimination step increases strongly the 
precision of the algorithm since it reduces the number of false positives, but it decreases also its sensitivity cancelling 
many weak pair interactions. 

The second method is the central one of our paper. It is based on a message-passing algorithm (more precisely a 
belief-propagation algorithm) which characterises the statistical properties of the set of all potential coupling vectors, 
weighted with respect to Hamiltonian (j4|) and the number of relevant links (|5|). We will see that this method is able 
to - at least partially - recognise the collective effects between different inputs in the data-generating rule (p?0)) . It 
therefore goes beyond simple two-point correlations, at the cost of being related to a more specific model of gene 
regulation. 

Both algorithms will output a ranking of all potential input variables 1, N according to the considered significance 
measure, going from the most significant genes to the most insignificant ones. The method itself has no intrinsic 
criterion where to cut this list. It his, however, possible to use suitable data randomisation procedures which give 
an indication at which level the significance value of an input may be explained by a random input-output relation 
(more precisely one may assign a p- value giving the probability that the significance value or a larger one comes from 
a random input-output relation) . 



A. Relevance networks 

Let us first explain pair-correlation based methods, like relevance networks @, |8| . They are based on measuring the 
correlation between the output variable and each input variable i G {1, ...,N}. The central quantities hereby are 
the joint distribution p{xq, x{) of the two variables, as estimated from the data via 

1 M 

p(x ,Xi) ~ 77 X, S ( x °' x o) S ( x i> x i) ' ( 6 ) 

and its marginals 

1 M 

P(«o) = ^p(x ,Xi) ~ jj^2S(xo,Xo) 

Xi — 1 

1 M 

P{x%) = ^p{x Q ,Xi) ~ —^2,5{xi,x1). (7) 

x ' fJ,= l 

The central question is, in how far the two variables are correlated, i.e. in how far the joint distribution is different 
from the factorised distribution p(xo)p(x\). This can be measured either via the connected correlation coefficient, 

C 0i = X \p(x ,Xi) -p(x )p(xi)] x Q Xi , (8) 

or via the mutual information, 

hi = y ] p{x ,x l )log 7"T7\ ■ ( 9 ) 



E , \ -i p(x ,Xi 



Note that Eq. is equivalent to the standard definition of the connected correlation, but the present formulation 
brings to evidence the distance measure between the joint and the factorised probabilities. The mutual information 
is also known as the Kullback-Leibler divergence between p(x$, X4) and p(xo)p(xi). It equals zero if and only if both 
are equal, i.e. if xq and Xi are statistically independent, and it is positive otherwise. 

These numbers (or, in the case of the correlation coefficient its absolute value) can be ordered and give thus a 
ranking of the inputs according to their statistical correlation with the output. One can expect the most significant 
correlations to come from direct interactions, whereas weak correlations are a sign for a non-existing link in the 
data-generating system (model or experiment). 

Note that this method is not able to detect the directed character of gene regulatory interactions, since the very 
definition in Eq. (jHJ) is symmetric against exchange of variables and i. This is a major drawback of correlation based 
methods, and can be eliminated using the algorithm proposed in the next subsection. 
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B. Belief propagation for network inference 



In this section we present our message-passing algorithm for network inference. It has two major differences to the 
correlation based method presented before. On one hand, it is model based (cf. Eq. ([3])) and therefore may fail in 
situations where this model is not sufficiently well-adapted. On the other hand, this model-based approach allows 
to take into account the joint action of all potential regulator variables. Therefore it may well go beyond the simple 
two-point correlations used in the previous section. 

The basic step is the introduction of a global weight function 



W(J) = exp [~(3H{J) - hN eff {J)} 



(10) 



weighing all candidate coupling vectors J (i.e., all potential regulatory networks) with respect to their energy 1~C(J) 
and the number N e ff(J) of non-zero coupling components, sec Eqs. (|4I5[) for the definitions. This weight depends 
on two not yet specified parameters: The formal inverse temperature (3 controls the energy. In the limit (3 — > oo the 
weight concentrates in the global minima of 7i. The formal external field h controls, on the other hand, the density 
of non-zero entries in J. Due to the biologically reasonable request for diluted J, we expect this field to assume 
high values. There is no obvious strategy of fixing these two parameters, but we will describe a reasonable heuristic 
strategy. 

Note that the data represent the quenched disorder, whereas the N components of the vector J are the (ternary) 
degrees of freedom. The primary algorithmic aim here is to determine the marginal single-site distributions 



{^•6{0,±l};j^} 



(11) 



where the normalisation constant in this equation is, as usually, given by the partition function. A direct calculation 
of these marginals is a computationally intractable task, since Eq. (|11|) contains the sum over N — 1 ternary degrees 
of freedom, i.e. we have to sum up S 1 ^^ 1 terms. It is therefore necessary to find some - possibly heuristic - tools which 
render this task efficiently solvable. 

Here we use the idea of message-passing (MP) techniques. These were initially created for problems with sparse 
constraints, i.e., variables have to be contained in few constraints, and constraints have to contain few variables. 
Typical examples for such problems are random 3-SAT and colourings or vertex covers on finite-connectivity random 
graphs. Here we are in the opposite situation: Each variable Jt is contained in M constraints given by the expression 
patterns a?*, /x = 1, M, and each constraint contains all ./V variables. Recently the applicability of BP has, however, 
been verified in a number of cases ranging from information theory to perceptron learning [29L l30l . l3lj . In [PA [l8| we 
have adopted these techniques to the specific needs of our network-inference problem. 

The BP equations can be easily written down. Each constraint sends a message to each variable, and each variable 
sends a message to each constraint: 



-h\Ji\ 



Hp- 



, exp ■ 

{J j, j¥i} 



i(Ji) 



-pe 



(12) 



The messages have an intuitive interpretation. The variable sends the information to the constraint: "If you were 
not present, I would have the statistical properties given by Pj^ jU (J,)." The constraint, on the other hand, sends a 
message to the variable: "Seen the behaviour of the other variables, please behave according to p M _>i(Ji) in order to 
satisfy me." Note that there are 2MN of these messages, which form a self-consistently closed set of equations. The 
true marginal distribution can than be estimated from the messages sent by all constraints, 



(13) 



From a computational point of view, we still have not gained anything. The second of Eqs. (fT2")l still contains the 
exponential sum over all but one Jj. This problem can be solved by the following observation: The dependence on 
the Ji enters via J2 k JkX^ and via the factorized distribution Pj^n(Jj)- F° r N S> 1 we can apply the central 
limit theorem and replace the sum over the N — 1 couplings by a single Gaussian integral, for finite N it will be used 
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as a computationally efficient approximation. We replace the second line of Eqs. (JT2]) by 

dh | (h — hi, 



Pn->i(Ji) oc 



with 



/27TA, 



hfi — >i 



■ exp ■ 



£< 



2A 2 ■ 



(14) 



2 



(15) 



where Oj^p denotes the average over Pj^^(Jj). Due to the step-like shape of the Heaviside function, the right- 
hand site in Eq. (|14[) can be expressed via the sum of two error functions, which again leads to a faster numerical 
implementation. 

Note that, in contrast to what happens in most combinatorial optimisation problems (and, in particular, also in 
Bayesian and dynamic Bayesian network learning), we do not have the aim to construct one single instance of a 
"good" coupling vector J, because this vector may be quite different from the original data-generating vector. We are 
instead interested in characterising the ensemble of all sets of vectors. More precisely, the marginal probability Pi(Ji) 
gives us a confidence measure for the hypothesis that the output variable xo actually depends on Xi : It measures the 
fraction of all suitable coupling vectors J (as weighted by Eq. (TIT)]) ) where the entry i takes value Ji. We can therefore 
base a global ranking of all potential couplings in the probability (1 — Pi (Ji = 0)) of having a non-zero coupling (or 
alternatively on the average coupling (Jj)j). 

Having calculated the messages and the marginal distributions, we may determine the energy of the average coupling 
vector 




N 



<# ■£<•*: 



(16) 



and the Bethe entropy 



M 



N 



S = ^2S fi -(N-l)J2S i 

/i— 1 i— 1 

characterising the number of "good" coupling vectors. In the last expression, the site entropy Si is given by 

Si = - J2 lnP i( J i) . 



Ji=-X,0,l 



and the pattern entropy 



S» = -£W) 



(17) 



(18) 



(19) 



exp ■ 



-pe 



4 ■ £ J k3 



can be calculated in analogy to p^—ti via a Gaussian approximation of the sum over J. 

So far, we did not fix the parameters j3 and h controlling the energy and the dilution of the inferred vectors. A 
simple heuristic strategy is the following: Fix the number N e tf(J ) externally to some desired value, and start at 
high temperature and initial h given by N e ff = 2coshft,/(l + 2cosh/i). Update the messages starting from some 
random initialisation, and calculate the energy of the average coupling vector and the Bethe entropy. Now decrease 
the temperature gradually towards the zero-entropy (or zero energy), adapting h such that N e ff{J ) remains close 
to the desired value. This strategy can be repeated for various values of N e ff. In practical applications, also an 
optimal N e ff can be determined by first dividing the data set into a training and a test set, calculating the marginal 
probabilities on the training set, and minimising the corresponding prediction error of the average coupling vector on 
the test set. In the case of artificial data coming from a known coupling vector, we can train on all data and compare 
directly with the correct vector. 
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IV. TESTS ON ARTIFICIAL DATA 

Before coming to a theoretical analysis of the algorithmic performance, we make some numerical test to get a first 
impression of the predictive power of the two algorithms. 

A. The data generator 

In order to check and compare the performance of the algorithms presented below, we have to introduce a data 
generator. It allows to create data sets with well-controlled properties, and the knowledge about the data-generating 
rule allows to compare the results of the inference algorithm to the true rule. 

In this work, we concentrate on data generated by a simple threshold function with couplings JP, 

a%=f&gafej?x!t + ifj (20) 

for all fi — 1, M. The inputs x M = (x^, x^) have independently distributed binary entries. With respect to the 
ternary classification into activators, repressors and non-interacting genes, this rule has two major differences 

• Heterogeneity: The couplings Ji may take values which are different from 0, ±1. This allows to introduce regu- 
latory variables of different interaction strengths, i.e. in particular weak and strong activators resp. repressors. 
In this work, we will study in particular the case where Ji € {0, ±1,±2}. The couplings itself will be drawn 
randomly and independently with respect to a distribution 

p(jf) = (l-h- k 2 )5(jf) + ^[6{J* + 1) + £(J° - 1)] + ^(JP + 2) + 5(J° - 2)] (21) 

Generalisations will be straight-forward, but they do not alter at all the conclusions of this work. Note that real 
gene-regulatory networks are sparse, i.e. we have to work in the regime ki r 2 *C 1. In the statistical mechanics 
calculations presented below, we will consider the thermodynamic limit N — > oo. At a first glance it seems 
to be reasonable to assume the usual scaling of finite-connectivity graphs, ki t 2 = 0(N~ 1 ), for the coupling 
probabilities. Due to technical reasons we will, however, concentrate on the scaling kx,2 = 0(1) "C 1. This 
choice will be justified by comparison with numerical data. 

• Noise: In biological data, there are various kinds of noise. The first one is biological noise resulting from the 
stochastic nature of cellular processes. The second and more annoying one is experimental noise which is strongly 
present in most high-throughput data. In addition, biological data are normally incomplete in the sense, that 
not all existing gene expression levels are measured, that nutrient or mineral concentrations, temperature and 
other external factors are not recorded etc. In our data generator, we have included a simple additive noise term 
rf 1 which we assume to be independent for various patterns \i. More precisely we assume it to be Gaussian with 

rf = 

WW = 7^<W (22) 

for all (j,, v G {1, M}. Note that the scaling of the variance with yN makes the noise to be of the same order 
of magnitude as the signal Jfx^. For the special case 7 = k\ + 4fc 2 , both noise and signal have exactly the 
same statistical properties. 

Note that this data generation rule is more complex than the inference rule ©. This will result in non- feasible 
data, i.e., a non-zero "energy" Tl(J) even for the best inferred couplings J. The best possible result would be 
Ji — signjP (here with sign(0) = 0) since this would correctly assign the attributes activating, repressing, irrelevant 
to the couplings. Due to the presence of noise as well as due to the finite amount of data available, such a perfect 
prediction will be impossible. We therefore introduce the following notation 

JP = J t = 

J<V J l = 

JP = J t ? 
JP ^ J t ? 



true negative (TN) 

false negative (FN) , . 

false positive (FP) ^ ' 

true positive (TP) 
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It can be refined taking also into account the relative sign of the original and the inferred coupling, cf. the above 
discussion about topology and functionality. One of the aims is to predict a fraction of all couplings with high 
precision, i.e. to have an as high as possible number of TP with a low number of FP. The quality measure we use will 
be the confrontation of the recall, or sensitivity, 

RC = ^ZZ (24) 

N TP + N FN 

and of the precision, or specificity, 

PR = TP__ 

N T p + N FP ' v ' 

The recall describes the fraction of all existing non-zero couplings which are predicted by the algorithm, whereas the 
precision tells us which fraction of all predicted links is actually present in the data generator. 



1. Feasible data 



As a first test, we have run BP on data sets with = 7 = 0. In this case, also the couplings of the data generator 
function assume only the three values Jf S {0,±1}, and noise is absent. This means that the data set is feasible, 
the model ([3]) on which we base our inference is able to reproduce the data without errors. We therefore work 
automatically at zero formal temperature, and thus at zero energy. 

The difficulty of the test problem comes from the fact that we are using an extremely sparse coupling vector: Only 
3 out of N = 600 input variables are coupled to the output via a non-zero coupling J?. We also present relatively 
small data sets with M < 70. Is BP able to recognise the three important input variables? 
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FIG. 3: Network inference from a feasible data set: Displayed are the overlap between the generator and the BP output as a 
function of the number of presented patterns, for various values of z = e h . The data generator has iV = 600 inputs, but only 3 
are non-zero. Curves correspond to one single test sample. 

First, we have done experiments without the exterior diluting held h. The results are given in Fig. 02 where the 
quality of the prediction is measured by the overlap 

1 N 



10 



This overlap takes for all considered numbers M of presented patterns only very small values, the algorithm is not 
able to recognise the relevant inputs. The situation changes drastically, if we add the diluting field h. We find, at 
some /j-dependent point, a discontinuous jump from a very bad BP solution with small overlap to a perfectly polarised 
solution with q = 1. The algorithm has perfectly recognised the relevant three inputs, and thus perfectly reconstructed 
the data generating rule! The number of needed patterns decreases with h, and can be as small as M = 35 for large 
fields. 

For even larger fields, the algorithm runs into problems: The dilution becomes as important for the algorithm as 
the energy constraints, and the BP solution polarises completely to the vector = 0. Best performance is obtained 
for fields just below this point. 

2. Unfeasible data 

The success of BP for feasible data sets encourages us to go ahead to more involved cases, i.e., cases with hetero- 
geneities in the coupling strength (ki,k2 ^ 0) and possibly with noise (7 ^ 0). Here we give a first example of a 
data generator with TV = 600 inputs, but only 30 of them have a non-zero coupling to the output - 15 of them with 
an absolute value equal to two, 15 with absolute value one {k\ = k 2 = 0.025). We generate M — 300 patterns and 
apply both inference algorithms. The resulting rankings can be cut at arbitrary points, such that we can measure a 
recall-vs. -precision curve as given in Fig. |4j 




n I 1 I 1 I 1 I 1 I 1 

0.2 0.4 0.6 0.8 1 

RC 



FIG. 4: Precision versus recall for inference with BP (green line) and via pairwise mutual information (red line). The parameters 
of the model are N = 600, M = 300 and kx = k% = 0.025, corresponding to 15 strong links ( J° = ±2) and 15 weak links 
(J = ±1), noise is absent. The dilution value of BP is chosen to be slightly smaller than the initial dilution, n e ff = 0.037. 

We find that both algorithms start at precision one for the highest ranking inputs. The algorithm based on mutual 
information includes, however, the first false positive after 8 true positives, whereas BP is able to find the first 14 true 
positives before including the first erroneous link. Also later on the performance of BP remains well above the one 
of the MI. Note that most of the initially recognised links are strong activators or repressors, only one of the first 14 
true positives in BP is an activator with jf = 1. Stronger signals are obviously detected first. 

Even if not being perfect, it is to be noted that both algorithms perform much better than a random ranking which 
would fluctuate around an average precision of only PR = k\ + hi — 0.05, as marked by the blue line in the figure. 
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V. A STATISTICAL ANALYSIS OF THE ALGORITHMIC PERFORMANCE 



Can we explain this behaviour analytically? Can we understand the discontinuous transition from very bad to 
perfect inference in the case of a feasible data set? Can we understand the relative behaviour of BP compared to the 
different pair-correlation based methods? 

In the following we will first calculate exactly the precision-versus-recall curve for the correlation coefficient and 
the mutual information, and than we go to an analysis of the BP performance via a statistical mechanics analysis of 
the Gibbs weight given in Eq. (|10|) . In both cases, we are mainly interested in the thermodynamic limit N — > oo. In 
this limit, we also consider M to diverge, with a = M/N being asymptotically constant. The probabilities ki and k% 
for non-trivial couplings in the data generator are considered to be finite, i.e. non-trivial couplings constitute a finite 
fraction of all entries of J°. 



A. Pair correlation and mutual information 



Let us recall the estimate of the joint distribution of the values of variable and variable i as calculated from the 
data. Using Eq. ©, and plugging in Eq. (12"01) for the value of the outputs Xq, we get 



1 M 

p(xo,Xi) = mY1 ^o, sign^ • 0* + ?f}) 6{xi,a%) 



(27) 



ij,=i 



Due to the random nature of the input patterns, this quantity itself is a random variable with some distribution 
P(p(xo, Xi)). Since the different patterns are statistically independent, the joint probability results as a sum over 
M = aN independently and identically distributed variables. According to the central limit theorem, P(j)(xq, Xi)) is 
thus Gaussian. Its mean can be calculated as 



p(x ,Xi) = -jj ^2 ${xo, signify ■ x 1 + r/ 1 ]) 5{x u x]) 



a% , . . . .x A 



= ^ E 5 



assign 



J°x, + J 3 X ] + V 1 



(28) 



where we have used that the input variables are independent and unbiased, and that the mean of all M contributions to 
the sum is identical. For calculating this sum, we use another time the central limit theorem, the sum y = ^Zj^j ^j x ] 
is Gaussian distributed with mean zero and second moment 



y 2 = (ki + Ak 2 + j)N + 0(1) =: A 2 



We thus get 



p(xo, 



= ; h - 



dy 



2?rA, 



-e-y'/^ 8(x ,aga[j9xi+y]) 



y/Qsl + Ak 2 + l)N 



(29) 



(30) 



with H(x) — ^^=e " 2 . We expand this function around zero, and find 



1 



p(x ,Xi) = - + 



Jfx Q Xi 



4 2^(^+4^+7)^ 
The calculation of the variance is now straight-forward. We have 

M — 1 



+ 0{N- 1 ) 



p(x ,x l ) 2 



p(x ,Xi) +—p(x ,Xi) 



(31) 



(32) 
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and consequently we find 



1 



[Ap(x ,Xi)} 2 = — p(x ,Xi) (l-p(x ,Xi)) 

-3/2 



+ 0{N- A ^). (33) 



Finally we need the correlation for different value pairs (xo,Xi) ^ (x ,2:Q. Following the the same strategy as before, 
we find 

Ap(x ,x % ) Ap(x> , x> % ) = + 0(N- 3 / 2 ) . (34) 

A side remark is necessary here. The covariance matrix of all four values of p(xQ,Xi) is degenerate, with a zero- 
eigenvector (1, 1, 1, 1). It is, however, clear that all four quantities cannot have a simple joint Gaussian distribution - 
given three, the fourth is determined by normalisation, ^2 Xo x . p(xq, Xi) = 1. The zero-eigenvector thus corresponds 
to the forbidden normalisation breaking fluctuations. 



1. Pair correlations 



Let us, however continue with the pair correlations between the output variable Xq and a potential regulator x^ 
We write 



C 0i = ^ P( X 0,Xi) 



X Xi 



p(l, 1) +p(-l, -1) - p(l, -1) - p(-l, 1) 

2p(l,l) + 2p(-l,-l)-l 



(35) 



where, in the last line, we have exploited the normalisation of p(xq, x\). Due to imperfect sampling in a finite data 
set, this is again a Gaussian random variable. Using Eq. (|31|) we find its mean to be 



Coi — 



2J? 



+ 0(N- 1 ) , 



(36) 



^^(fci + 4fc 2 + TpV 
whereas its variance can be calculated directly from Eqs. (|3"3"1) and (|34|>. 

(AC^=^- + 0(N- 3 / 2 ) 
aJS 

Both mean and variance scale as 0(N~ 1 ^ 2 ), introducing thus Co, = \/~N Coi we find its probability distribution 



(37) 



^/27r(fci + 4fc 2 + 7) 



conditioned to the value of the coupling J? between variables and i in the data generator (|20| . At this point, 
one recognises the potential and also the limitations of network inference via pair correlations: The mean value of 
the correlations of different variables is well separated by a term being proportional to the coupling strength, and in 
particular strong activators and inhibitors should be distinguishable from irrelevant variables. On the other hand, the 
variance of distribution (|38[) is of the same order of magnitude as the mean, i.e. the distributions for different values 
Jf overlap, and errors in the network inference are unavoidable. The only way to suppress errors is to augment the 
amount of data, the variance of the measured correlation values around its expectation decays as 1/ y/a. 

The measured pair correlations, or more precisely their absolute values, serve as a ranking for the inclusion of 
different variables into the reconstructed network. Let us therefore imagine that we introduce a cutoff 9 > for the 
absolute value of the rescaled correlations Cqi, i.e., for all i with |Coi I > $ we introduce a coupling, for all inputs with 
a smaller absolute correlation value we assume the coupling to be absent. The fraction of included couplings with 
given original J° is than found to be 



p ( |Coi| > e 1 j°) = J 6 dc Ql p(c 0i \J ) + J™ dc 0l p(c 0i \J ) 



H ^ 



2J? 



^27r(fci + 4fc 2 + 7) 



H 



2J? 



y /2n(k 1 + 4fc 2 + 7) 



(39) 
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For J° 7^ 0, the corresponding included couplings are true positives. For J° = 0, they are false positives. We thus 
can write up the recall and the precision of the inferred network, in dependence of the cutoff parameter 9: 



kiP[ \C 0l \>9\ ±l\+k 2 P[ \C 0l \>9 | ±2 
k\ + k 2 



ktP \Coi\ >0\ ± 1 + k 2 P[ \Coi\ > 9 I ±2 
PR{0) = r — ^ s '-7—. * v J f— r- (40) 

(i - h x - k 2 )p ( |c 0i | > e i oj + hp ( \c 0i \ >6 \ ± lj + k 2 p ( \Coi\ >e \ ±2) 

Changing the value of 9 from high to small values, we increase the recall from zero to one, at the cost of a decreasing 
precision. More details will be given below, in the comparison to the results of the massage-passing based inference 
algorithm. 

Note that we did not require the correct determination of the sign of the coupling, a generalisation to this case is, 
however, straight forward. For not too small a, it practically does not lead to recognisable differences, i.e. the waste 
majority all true positives is recognised due to the correct sign of the measured correlation. 



2. Mutual information 



In how far does the network reconstruction change if we go from correlations to the pairwise mutual information as 
given in Eq. |9|)? The latter has a nonlinear dependence on the pair correlations p{xq,x\), and consequently it does 
not have a Gaussian distribution. We may, however, use the fact that variables are only weakly correlated, and write 



p(x ,Xi) 



z(x Q ,Xi 



N 



(41) 



The newly introduced function z{xq,x\) contains all non-trivial correlations between xq and x±, and it is of O(N ). 
Note that at this point we consider Eq. (|4lT) to be an exact definition of z(xq, Xi), i.e. the latter contains all the lower- 
order corrections in N and not only the dominant term. Due to the normalisation of p it fulfils x- z ( x o, %i) = 0. 
We further on write for the two marginal single- variable distributions of p{xq, X{): 



P(Xi) 



1 z(xq) 

2 v^V 

1 z(Xj) 

2 ~ 



N 



Plugging this into the definition of the mutual information, we find 



'Oi 



E 



1 z(x ,Xi) 

4 



N 



+ 4 zfaxj) )_ log ( 1+ l^0\ _ b g f i + 2 



N 



N 



N 



(42) 



(43) 



Expanding the logarithm up to second order in the z, and using the normalisation to eliminate all linear terms, we 
find after some straight-forward algebra 



= |ki,i) + s(-i,-i)] 2 + o{ X 



r 2 

°oi 



O 



(N-i) 



(44) 



We thus conclude that, at least under the scaling limits considered in this work, the correlation coefficient and the 
mutual information carry the same information. A network reconstruction algorithm based on either of these two 
quantities thus leads to the same results. 



B. Characterising the set of coupling vectors 



In the last subsection, we have analytically characterised the performance of a network inference algorithm based 
on pairwise correlation measures. Is a similar comparison possible for the belief-propagation algorithm introduced in 
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Sec. IIII BP As already mentioned, the algorithm is based on the measure 

1 



P{J) 



Z(J3,h) 



exp 



{-pH(J)-hN e Jf(f)} 



(45) 



which assigns a weight to every coupling vector J. This weight depends on its performance on the data set {x^} as 
measured by the Hamiltonian H(J) defined in Eq. and on the effective number N e ff(J) of relevant inputs J; ^ 
as defined in Eq. ([5]). A positive inverse temperature (3 introduces a bias toward J with few errors on the data set, 
a positive h favours diluted coupling vectors. The aim of the algorithm was to determine the marginal distributions 
Pi(Ji) for single entries of the vector J, and use the probability P,(Jj ^ 0) that such a component becomes non-zero 
for ranking the input variables according to their relevance for the output. 

Note that in this way we do not work with a single reconstructed network, which may or may not be similar to the 
data generator, but with a probability measure over all feasible coupling vectors. 

To characterise the performance of this ranking procedure we have to calculate first the partition function 



Z(fl, h)= exp i-m{J) ~ hN e ff(J)\ 



(46) 



./e{o,±i} f 



From this we can calculate the average energy 



(H)=eN = -— logZGM) 



(47) 



and the average number of non-zero couplings 



(N eff )=n eff N = -— log Z(P,h) . 

The entropy characterising the number of statistically dominant coupling vectors is now given by 

S = sN = log(Z) + (3(H) + h(N eff ) . 



(48) 



(49) 



Besides these standard thermodynamic quantities, we also need the distributions P(J ^ | J°), i.e., the probability 
that a single coupling component is different from zero, given the corresponding coupling in the data generator. This 
distribution will allow us to identify true and false positives following a similar strategy as in the last section. The 
calculation necessary to obtain these thermodynamic quantities is a modified Gardner calculation as introduced in the 
theory of formal neuronal networks. This calculation follows a path beautifully reviewed in the text book by Engel 
and van den Broeck (2lj . we therefore skip all technical details and present only the major steps. 

A first observation is that the free energy — logZ(j3, h) depends explicitly on the realisation of the input data, 
i.e. on the x^, the noise r]^ and the coupling vector of the data generator. It is consequently a random variable 
which is, however, expected to be self-averaging: In the thermodynamic limit, its distribution is expected to become 
more and more concentrated around its average, and relative fluctuations are expected to disappear. We can therefore 
replace the actual free energy by its mean value — /3 _1 log Z(j3, h), where the overline denotes the joint average over 
af, rf and J°. Technically this can be achieved using the replica trick: We write the logarithm as 



Z n - 1 



log Z = lim 



(50) 



calculate the right-hand side first for positive integer n, introduce an analytical continuation to real n and perform 
the limit. The average over Z n for integer n can be performed easily since the n-th power can be interpreted as an 
ri-fold identical replication of the original system. 
We thus have to calculate the following: 



A' 



7(2 7 AT) 



{J a ;a=0,...,n} i=1 



=1 \J?\ 



A I 



exp -/?£E Q 

a—l i-i—l 



N 



X i 



i=l 



N 



(51) 
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Here, a — 1, n enumerates the n replicas of the system, whereas a — stands for the data generator. Following the 
path explained in [2l| . we can extract the interesting quantities under the assumption of replica-symmetry: 



2a / Dy H 



kPq — r 2 



-II 



n eff -q 



J 1 + {e-l 3 - l)H 



-I) 



s = -n e ffk - rf + -qq + 



Dx 5>(J°)log 

J° 



n e ff-q 

1 + 2e £+ « /2 cosh{f J° + xyfq~} 



(52) 



+2a / Dy H 



k°q — r 2 



log ( 1 + (e"' 3 - 1)H 



+ fl- 



ier f-q 



1 + {e-l 3 - l)H 



n^ff-q 



with the Gaussian normal distribution Dy = dy/\/2~jr e v / 2 , k° = k\ + 4&2 + 7 and the order parameters being 
determined via the self-consistent saddle-point equations 



l eff 



-(J a -J a ) 



rrOs 2p>+9/ 2 cosh{f J° + xVq} 

Dx > p (J ) f 

^ 1 + 2e fe +9/2 cosh{f J° + xy/q} 



Dx ]Tp(J°) 

J° 



2e k+q/2 smn {f J° + xVg} 
1 + 2e fe+< ?/ 2 cosh{f J° + 2;^} 



n 2 



iV 



(J 6 • J a ) 



afc°r 



1 + 2e fc +'j/ 2 cosh{f J° + xv^} 
k°qy 2 



■K(k°q - r 2 fl 2 
ak 

""v^K// - q) 3/2 

ak°r 
7r(fc°<j - r 2 ) 3 / 2 



dy y exp 



dy y exp 



2{k°q-r 2 ) 
ky 2 



log ( 1 + {e- 13 - l)H 
e-PH -y 



n eff 



neff-q 



dy y exp 



2{neff ~ q)j ^(e^-l)H[-y^= 
k°qy 2 



2{k°q 



log ( 1 + {e~ (i - 1)H 



-I) 



(53) 



The parameter n e f / measures the self-overlap of a typical coupling vector with itself, and thus the desired fraction of 
non-zero components, q measures the overlap between two different coupling vectors, r the overlap between a typical 
coupling vector and the coupling of the original data generating vector J^, and their conjugate parameters. All these 
parameters still depend on (3 and h. As explained in the algorithmic section, we fix them by requiring n e ff to be in 
some small given interval, and by increasing the temperature such that either the energy or the entropy go to zero, 
focusing in this way to the ground states at given n e f / . 

At this point, we are also able to express the probability P(J 7^ 0| J°) that a link is present in an inferred coupling 
vector, in dependence on its value in the data generator. We find that it takes the value 



P(JjLQ\J°) = 



2e fe+g/2 coshjf JO + 

1 + 2e^+«/ 2 cosh{f J° + xs/q} 



(54) 



with x being a random variable drawn from a Gaussian normal distribution. This random number represents the 
heterogeneity of the different inputs having the same value of the coupling in the data generator, which persists due to 
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the imperfect sampling at finite a. Accepting all those inputs having a value of P(J ^ 0| J°) larger than an arbitrary 
threshold € (0, 1), we find the fraction of accepted couplings to equal 



P [P(J J°) > 



H 



-H 



1 

71 



- VfJ° + acosh 



2e fc+<?72( 1 



^( + ^ + acosh^i^' 



(55) 



In strict analogy to Eqs. (I40j) . we can determine the recall and the precision of the inferred network, in dependence 
of the cutoff parameter </>: 



RC{<j>) = 



fciP [P(J + 0| ± 1) ><j>\ + fc 2 P [P(J ^ 0| ± 2) > 



fci + fc 2 
fciP [P(J j6 0| ± 1) > 



fc 2 P [P(J ^ 0| ± 2) > 



(1 - ki - fc 2 )P [P(J + 0|0) ></)]+ fciP [P{J ^ 0| ± 1) > 0] + fc 2 P [P(J £ 0| ± 2) ><}>} 



(56) 



Again, for a very strict acceptance criterion <p slightly below one, we start with a small recall, but high precision. 
Approaching zero with the cutoff, the recall becomes better and better, but the precision goes down. 



C. The algorithmic performance 

1. Feasible data and perfect reconstruction 




FIG. 5: Phase diagram for a feasible data set with ki = 0.005, &2 = 0, 7 = 0. The red line separates the phase space region 
with error-less solutions from a region where the ground-state energy is positive. The always existing perfect solution J — ,P 
is given by the green line. 

In Sec. IIV A II we have seen that, in the case of a feasible data set, a discontinuous jump from a region of bad to 
perfect network inference appears. This jump can be easily understood on the basis of the phase diagram displayed 
in Fig. [5J If we have a relatively small number of input patterns (a < 1.287 in the parameter choice of the figure), 
there is a large variety of J vectors perfectly reproducing the data set. The point is, however, that almost all of 
them are very dense, i.e., they have a large n e //. Without an external field favouring diluted couplings, the algorithm 
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automatically concentrates to the maximum entropy point at high n e ff (which we found numerically to be always 
at n e ff > 0.5), and the quality of the reconstructed network is very bad. Increasing the diluting external field, the 
value of n e ff decreases first continuously, remaining still in the region of positive entropy and zero energy. At a 
certain field we hit, however, the red line in Fig. [5] It marks the phase transition to a region where ground states 
have positive energy and zero entropy. Our algorithm is, however, forced to work at zero energy, it jumps therefore 
to the next diluted solution with e — - which is the perfect solution J = J°. The algorithm is thus able to perfectly 
infer the data generator as soon as we have a value of a = M/N being larger than the one of the intersection point 
of the phase boundary and the perfect solution line (More precisely, this point is a lower bound since BP stays a 
while at a metastable solution with negative entropy before jumping to the perfect solution). Using the parameters 
of the example in Sec. II V A H and Fig. we find that for perfect reconstruction we need to have a — M/N > 0.053, 
for N — 600 potential inputs this corresponds to at least 32 patterns. Note that this value is consistent with the 
single-sample experiment in Sec. IIV A II 

The existence of a first order transition underlines the necessity to introduce the diluting field h. Without the field, 
one would have to work at a > 1.287, which, in the case shown in Sec. IIV A~T) would correspond to more than 770 
patterns. With the use of the field, we have perfect generalisation already with slightly more than 32 patterns. This 
increase in predictive power is impressive, remembering in particular that the diluting field was biologically motivated 
by the fact that gene-regulatory networks are sparse. 



As already discussed before, the feasible situation is very unrealistic, and therefore only a first (but important) test 
case. In the following, we are going to investigate the behaviour for heterogeneous and possibly noisy data generators. 

In this section, we therefore consider data coming from a generator with couplings J® € {0,±1,±2}, but try to 
infer them on the basis of our ternary classification to activators, repressors and irrelevant inputs. The more complex 
nature of the data generator leads to the fact that no perfect zero-energy solution exists any more, and we have to go 
to finite energies (resp. finite temperatures) in the algorithm. As explained before, the parameters are fixed such that 
we work at various preselected dilutions, and the temperature is chosen such that we are in a corresponding ground 
state, i.e., we work at zero entropy. 

More precisely, in all the rest of the chapter we concentrate on the parameter selection ki = k^ = 0.025, i.e., 2.5% 
of all relevant links are weak activators or repressors, and other 2.5% are strong couplings. The remaining 95% are 
irrelevant inputs. 

Let us first concentrate a moment to the error-less case, 7 = 0, and investigate the influence of the number a of 
presented patterns per input bit. In Fig. [5] we have given the results for the precision-vs. -recall curves at various 
dilutions. If the density of the inferred network is set too high (n e ff = 0.1 in the figure), the performance is not very 
good, but it increases if we further dilute the inferred network. The best performance is obtained if the dilution of 
the inferred network is slightly stronger than the one of the data generator. The reason for this finding is very clear: 
In this way the algorithm keeps the strongest good signals, but cuts many of the strongest false positive signals - 
obviously at the cost of introducing some false negatives. A stronger diluting field would cut out more true positives 
becoming thereby false negatives, a weaker field would include more false positives which before were true negatives. 
A more formal reason can be seen looking to Eq. ()54[) : In this equation the effective signal due to a coupling J° is 
given by fJ, and it is distorted by the noise x-\fq, with x being normally distributed. The best inference is possible 
if diverse signals are maximally separable, i.e., if the noise-to-signal ratio y/q/f is minimal. In Fig. [3 we plot this 
ratio for various a, as a function of the effective dilution n e ff- The optimal value is found at n e ff < n eff : We also 
see that the position of the minimum moves toward fi^// for growing a. 

In Fig. [HI we have also included the corresponding curves for algorithms like ARACNe based on pairwise correlations 
- the message-passing approach takes into account collective effects and therefore clearly outperforms the local pair- 
correlation methods. 

In Fig. [8] we have compared the analytical findings with the performance of BP on single samples of size N = 600, 
averaged over 500 samples. Even if the algorithm includes a Gaussian approximation compared to the exact BP 
equations, the coincidence of both approaches is very good. 

Another striking finding of Fig. [5] is the strong improvement of the predictive quality with increasing data sets: For 
a — 0.1, the precision starts to decrease immediately with increasing recall, whereas it stays practically equal to one 
for higher values of a, cf. also Fig. [5] This observation can be quantified better by measuring the prediction error 



2. Unfeasible data and partial reconstruction 




(57) 




FIG. 6: Theoretical results of the performance of BP versus pair correlation based methods like ARACNE, for various values 
of a = M/N. 

which gives the area over the precision-vs. -recall curve. The prediction error is zero for a perfect predictor, and goes 
to 1 — ki — k2 — 0.95 for a completely random predictor. In Fig. [10] we have plotted the prediction error as a function 
of the size of the data set as given by a: The curves for both algorithm types start obviously at = 0.95 for empty 
data sets (cv — 0), without data there cannot be any prediction. The error than decreases exponentially fast with 
growing a, but we see that BP behaves much better than the curves corresponding to the mutual information. 

To complete our analysis, we have also included errors into the data generator, fixing all other parameters to a = 0.5 
and n e ff — 0.04. Other parameter values lead to qualitatively equivalent results. In Figs . fTTI and [T2l we have plotted 
the precision- vs. -recall curves and the prediction error for various values 7 of the noise strength. The predictive power 
of the algorithms obviously decreases with increasing noise level. Note, however, that even for 7 = k\ + 4fc2, where 
the statistical properties of the signal and of the noise are equivalent, a non-trivial prediction is possible. 

VI. CONCLUSION AND OUTLOOK 

In this paper, we have analysed the performance of two approaches to the identification of gene-regulatory interac- 
tions. Whereas the first method is based on a ranking of all gene pairs according to the measured correlations between 
their expression levels, the second method tries to include collective effects of activators and repressors coming from 
the joint action of multiple regulators. This second method, developed by the same authors in an earlier publication, 
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FIG. 7: Inverse signal-to-noise ratio: Best inference at given data-set size a is possible at the minimum of the curve. 



uses message-passing techniques (belief propagation) which can be understood as an algorithmic reinterpretation of 
the cavity method in spin-glass physics. 

Using a simple data generator, we achieved a theoretical analysis of the algorithmic performance. In the simpler 
case of pairwise correlations, this analysis is based on a straight-forward application of the central limit theorem. 
The more involved case of the second algorithm requires the use of other spin-glass techniques, more precisely of the 
replica method in order to perform a generalised Gardner calculation in the space of all candidate gene networks. We 
found that algorithms taking into account collective effects beyond pair-correlations work better. The intuitive reason 
for this finding is that the best joint prediction is not necessarily done by the set of the best individual predictors. 
These results are true even if the data generator is more complex than the inference algorithm. We tested explicitly 
the cases of heterogeneous regulation strengths and noisy data. 

There are some interesting directions to be followed. First of all, these algorithms are designed to be applied to 
real biological data. As already seen in [17], also BP can be used to extract sparse and predictive information from 
gene-expression data. Still there is space for substantial improvements by integration of biological knowledge which 
goes beyond a simple diluting field. One possibility is, e.g., the integration of sequence information on putative 
transcription factor binding sites via inhomogeneous diluting fields. 

One drawback in both the considered data generator and the derivation of the BP algorithm is that we neglected 
correlations between inputs. These correlations exist, however, in biological data, since also transcription factors may 
be coregulated. An interesting approach in this direction was recently presented by Kabashima [32| . and can be 
extended to our model. It would be interesting to see under which conditions correlations really change the behaviour 
of the algorithms. 

A last point is the application of the algorithm to other problems. In fact it gives more generally a sparse Bayesian 
classifier. Bioinformatical applications of such classifiers goes far beyond the inference of gene networks. One example 
we are currently analysing is the classification of expression patterns of cancer tissues according to their diagnosis 
(cancer vs. healthy tissues, different cancer subtypes), and the detection of predictive but sparse gene signatures for 
the different diagnoses [30l. [3^. 
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FIG. 8: Theoretical versus numerical results. The numerical results are produced at N = 600, and averaged over 500 realisations 
of the disorder. 
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FIG. 11: Precision versus recall of BP for various noise strengths 7 = 0.0, 0.025, 0.05, 0.075, 0.1, 0.125 (curves from right to 
left), ranging from no noise to equal signal and noise strengths. 



